Goal of the analysis¶

This work is based on the "Individual household electric power consumption Data Set" from UCI Machine Learning repository. Per the description, measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

The goal is a hypothetical addition of a solar power system to the house. The sizing of the system will minimize grid reliance in most situations. For simplicity, a few rough assumptions are introduced to simplify this preliminary analysis: namely, the sun is always shining (days and nights) and all the power from the solar panels can be used by the appliances.

To size the system, it is important to model the power consumption of the entire household and understanding trends. Finally, a six-months prediction of the power consumption is provided.

Table of Contents

  • 1  Goal of the analysis
  • 2  Introduction to the dataset
    • 2.1  Read household dataset
    • 2.2  Data types
    • 2.3  Null values
    • 2.4  Create index as date
    • 2.5  Look for missing dates (and minutes for each date) in the date index
    • 2.6  Interpolate missing values and save to parquet for easy retrieval
    • 2.7  Sanity check: null values
  • 3  Exploratory analysis
    • 3.1  Descriptive statistics
    • 3.2  Correlations
    • 3.3  Visualize Global_active_power distribution
    • 3.4  Choice of aggregation function for Global_active_power
    • 3.5  Resampling Global_active_power on a monthly scale
  • 4  Investigate stationariety
    • 4.1  Statistics on different sections of data
    • 4.2  Visual inspection (histogram)
    • 4.3  Augmented Dickey-Fuller test
    • 4.4  Autocorrelation
    • 4.5  Seasonal subseries plot (monthly)
  • 5  Modeling: Additive decomposition of monthly data
    • 5.1  Residual distribution and Augmented Dickey-Fuller test
  • 6  Modeling: Triple Exponential Smoothing
  • 7  Modeling: ARIMA
  • 8  Modeling: SARIMA
    • 8.1  Autocorrelation of seasonal component
    • 8.2  SARIMA fitting
    • 8.3  SARIMA forecast
    • 8.4  SARIMA six months forecast
  • 9  Modeling: Auto ARIMA
    • 9.1  Auto ARIMA fitting
    • 9.2  Auto ARIMA forecast
    • 9.3  Auto ARIMA six months forecast
  • 10  Cross-validation split and model performance evaluation
  • 11  Six-months future forecast from three models
  • 12  Findings (analysis)
  • 13  Findings and suggestions (models)
  • 14  Conclusions

Introduction to the dataset¶

Read household dataset¶

Data types¶

Attribute info (from UCI Machine Learning repository).

1.date: Date in format dd/mm/yyyy 2.time: time in format hh:mm:ss 3.global_active_power: household global minute-averaged active power (in kilowatt) 4.global_reactive_power: household global minute-averaged reactive power (in kilowatt) 5.voltage: minute-averaged voltage (in volt) 6.global_intensity: household global minute-averaged current intensity (in ampere) 7.sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered). 8.sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light. 9.sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 8 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   date                   datetime64[ns]
 1   Global_active_power    float64       
 2   Global_reactive_power  float64       
 3   Voltage                float64       
 4   Global_intensity       float64       
 5   Sub_metering_1         float64       
 6   Sub_metering_2         float64       
 7   Sub_metering_3         float64       
dtypes: datetime64[ns](1), float64(7)
memory usage: 126.7 MB
None

Null values¶

Null values are represented by '?' in the original csv file. Approximately 1.25% of the rows have missing values. The values are distributed along the dates, with some agglomeration in the final dates. Depending on the time aggregation (hours, days, weeks, or months) a proper treating of the missing values is required.

Null values per column:
null_values null_value_percent
date 0 0.000000
Global_active_power 25979 1.251844
Global_reactive_power 25979 1.251844
Voltage 25979 1.251844
Global_intensity 25979 1.251844
Sub_metering_1 25979 1.251844
Sub_metering_2 25979 1.251844
Sub_metering_3 25979 1.251844
<seaborn.axisgrid.FacetGrid at 0x2a93a695ca0>

Create index as date¶

Dates in this dataset have temporal resolution of 1 minute.

Number of dates: 2075259, Unique: 2075259
min                   2006-12-16 17:24:00
max                   2010-12-11 23:59:00
data_unique_values                2075259
expected_values                   2097035
dtype: object
No duplicated dates
Original dataset is missing 21776 dates

Look for missing dates (and minutes for each date) in the date index¶

No duplicated index dates
missing_minutes
dates
2010-01-12 1440
2010-02-12 1440
2010-03-12 1440
2010-04-12 1440
2010-05-12 1440
2010-06-12 1440
2010-07-12 1440
2010-08-12 1440
2010-09-12 1440
2010-10-12 1440
2010-11-12 1440
2010-11-26 177
2010-11-27 1440
2010-11-28 1440
2010-11-29 1440
2010-11-30 1440

As many values are missing after November 26th 2010, only the values before Nov-27-2010 are considered.

Interpolate missing values and save to parquet for easy retrieval¶

Sanity check: null values¶

As we preprocessed the dataset earlier, there should not be any.

Null values per column:
null_values null_value_percent
Global_active_power 0 0.0
Global_reactive_power 0 0.0
Voltage 0 0.0
Global_intensity 0 0.0
Sub_metering_1 0 0.0
Sub_metering_2 0 0.0
Sub_metering_3 0 0.0

Exploratory analysis¶

Descriptive statistics¶

The voltage columns is not very informative, as it is fairly constant. The sub_meterings and the global active power are on a different scale. As I am going to predict the Gobal_active_power time series only (univariate), this is not an issue.

count mean std min 25% 50% 75% max
Global_active_power 2059419.0 1.089090 1.053636 0.076 0.310 0.61 1.526 11.122
Global_reactive_power 2059419.0 0.123539 0.112450 0.000 0.048 0.10 0.194 1.390
Voltage 2059419.0 240.827439 3.239969 223.200 238.980 241.00 242.870 254.150
Global_intensity 2059419.0 4.616927 4.428958 0.200 1.400 2.60 6.400 48.400
Sub_metering_1 2059419.0 1.109423 6.121396 0.000 0.000 0.00 0.000 88.000
Sub_metering_2 2059419.0 1.289165 5.793815 0.000 0.000 0.00 1.000 80.000
Sub_metering_3 2059419.0 6.418500 8.411376 0.000 0.000 1.00 17.000 31.000

Correlations¶

The main goal for this analysis is the forecast of Global_active_power, the overall power consumption of the household. The three sub-metering are independent of each other, while Global_active_power is significantly correlated (especially to sub_metering_3). The Global_active_power is also correlated to Global_intensity: as power is calculated as the product of voltage (fairly constant) and current, this is expected. Voltage is an uninformative variable in this case.

Most correlated pairs:
Global_active_power  Global_intensity       0.998893
Global_intensity     Global_active_power    0.998893
Global_active_power  Sub_metering_3         0.638764
Sub_metering_3       Global_active_power    0.638764
                     Global_intensity       0.626822
Global_intensity     Sub_metering_3         0.626822
dtype: float64

Visualize Global_active_power distribution¶

The goal is to predict Global_active_power for the next months. We are going to compute aggregations on the dataset, since the temporal resolution of the dataset (minutes) is too fine for a long term prediction. Based on the goal of the analysis and the shape of the Global_active_power distribution, a choice must be made for the aggregation function that will be used.

Goal: compute the amount of power needed to size a solar panel system that would satisfy most of the household energy needs. Assumptions: sun is always shining, and all solar power from the panels is available to use. Shape of the distribution: bimodal with long tail on the right side.

<AxesSubplot:xlabel='Global_active_power', ylabel='Count'>

Choice of aggregation function for Global_active_power¶

Daily aggregations¶

Weekly aggregations¶

Monthly aggregations¶

For this work, the monthly 99th percentile is chosen as aggregation function (hence also predicted). The variable distribution is gaussian, and captures most of the monthly maximum usage needed to size the panel system. For the remaining 1% we can rely on the traditional electric grid.

Resampling Global_active_power on a monthly scale¶

Exclude data before July 2007, as it has relatively high magnitude with respect to the remaining data.

Investigate stationariety¶

The data is non-stationary: the mean value is not constant, there is a negative trend present, and seasonality is visible. It is not clear whether there is heteroscedasticity. The data is autocorrelated as well.

Statistics on different sections of data¶

Splitting the monthly data into 10 chunks, we can see that the variance across chunks is not constant (a sign of non-stationariety).

mean var
0 5.509652 0.152727
1 4.780756 0.026575
2 5.313856 0.184729
3 4.711944 0.071158
4 4.840024 0.087466
5 4.632652 0.213626
6 4.267048 0.125345
7 4.952680 0.094290
8 3.892915 0.079233
9 4.246335 0.113754

Visual inspection (histogram)¶

Augmented Dickey-Fuller test¶

According to the Augmented Dickey-Fuller test, the series is stationary. However, seasonality is clear, as well as trend, and this underlines the importance of visual inspection. A different choice of lag in the DF test function would likely give a different result.

Test statistic: -3.230, pvalue: 0.018
Critical values: {'1%': -3.5778480370438146, '5%': -2.925338105429433, '10%': -2.6007735310095064}
Observations: 47, Used Lag: 0

Autocorrelation¶

Differencing order¶

If we difference the data (takes the difference between one point and the next) and we analyze the resulting series, we notice that the result does not show any pattern, and the autocorrelation and partial autocorrelation values are all within the boundary of statistical non-significance. The first lag in ACF is negative, but not below -0.5. This means that the series does not need any more differentiating, and the differencing order d is 1. In general, d=1 means that the original data presents a constant average trend, which is the case for the data in exam.

Standard deviation (original time series): 0.57370
Standard deviation (diff=1): 0.40261

Autocorrelation Function (ACF)¶

ACF represents the correlation between the current value of the series and future ones. There is always a cross-over between successive lags in the ACF plot. In this case, it is likely that only the first value might be significant.

Partial autocorrelation function (PACF)¶

There is a sharp cutoff (while ACF decays more slowly), so the series can be well represented by an AR model. Only the previous value (lag=1) appears to be significant for the prediction. This translates into an AR(1) model. However, an AR(1) model is equivalent to taking a first difference (d=1, as seen above). The original series is hence under-differenced.

Seasonal subseries plot (monthly)¶

There is a clear seasonal effect, with lower power consumption in summer months.

Modeling: Additive decomposition of monthly data¶

Use statsmodels.tsa.seasonal.seasonal_decompose to analyze the time series with an additive model with period of 12 months. The seasonality pattern in the data is captured very well, same for the trend. The residual should be white noise, and they can be tested for stationariety with the Augmented Dickey-Fuller test.

Residual distribution and Augmented Dickey-Fuller test¶

Residual are normally distributed and Augmented Dickey-Fuller is positive for a stationary series.

Test statistic: -4.285, pvalue: 0.00047
Critical values: {'1%': -3.661428725118324, '5%': -2.960525341210433, '10%': -2.6193188033298647}
Observations: 31, Used Lag: 4

Modeling: Triple Exponential Smoothing¶

As we saw that the data presents both trend and seasonality, a triple exponential smoothing is a more appropriate model rather than single or double exponential smoothing models. In the plot, the solid blue line is the training data, the dashed line is the test data, and the green line represents the 6 months prediction on the test set.

Mean squared error: 0.06710, Mean abs percentage error: 0.05696

Modeling: ARIMA¶

The choice of an autoregressive (AR) model (of order p) and/or a moving average (MA) model (of order q) can be carried out using the Box-Jenkins Methodology. From ACF and PACF plots:

  • ARIMA(p,d,0) model: if ACF and PACF plots of the differenced data show the following: the ACF is exponentially decaying or sinusoidal; there is a significant spike at lag $p$
  • ARIMA(0,d,q) model: if the PACF is exponentially decaying or sinusoidal; there is a significant spike at lag $q$ in the ACF, but none beyond lag $q$

It is estimated that for my data an AR model with $p=1$ is appropriate.

ARMA model include an autoregressive (AR) and a moving average (MA) part. As the data is likely well represented by an ARMA(p,0) model, I'm going to start from an AR(1) model.

SARIMAX Results
Dep. Variable: Global_active_power No. Observations: 48
Model: ARIMA(1, 0, 0) Log Likelihood -22.163
Date: Sat, 22 Jan 2022 AIC 50.325
Time: 12:03:56 BIC 55.939
Sample: 12-31-2006 HQIC 52.447
- 11-30-2010
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
const 4.8085 0.203 23.741 0.000 4.411 5.205
ar.L1 0.7749 0.087 8.920 0.000 0.605 0.945
sigma2 0.1446 0.033 4.450 0.000 0.081 0.208
Ljung-Box (L1) (Q): 0.20 Jarque-Bera (JB): 0.21
Prob(Q): 0.65 Prob(JB): 0.90
Heteroskedasticity (H): 1.11 Skew: -0.04
Prob(H) (two-sided): 0.84 Kurtosis: 2.68


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Residuals from ARMA model¶

Residuals from the ARMA model look normally distributed and without any auto-correlation.

ARMA forecasting¶

Mean squared error: 0.21626, Mean abs percentage error: 0.10408

Modeling: SARIMA¶

With a SARIMA model we can introduce (over the ARMA model) a seasonal component, with its own (P,D,Q,s) parameters.

Autocorrelation of seasonal component¶

For SARIMA prediction models, it is important to analyze the seasonal component only for auto-correlation. It seems that only the first lag is important for prediction, again a seasonal SAR(1) model, this time for the seasonal component only. Once again, it is confirmed by differencing the seasonal time series (12 months difference).

With differencing:

SARIMA fitting¶

The parameters for the fit:

  • p = 1, d = 0, q = 0
  • P = 0, D = 1, Q = 0, s = 12

A linear trend is added (trend='t') to account for the linear trend (it could also be achieved using d=1).

Mean squared error: 0.06183, Mean abs percentage error: 0.05252
SARIMAX Results
Dep. Variable: Global_active_power No. Observations: 42
Model: SARIMAX(1, 0, 0)x(1, 1, 0, 12) Log Likelihood -8.349
Date: Sat, 22 Jan 2022 AIC 24.697
Time: 12:03:59 BIC 30.302
Sample: 12-31-2006 HQIC 26.490
- 05-31-2010
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
drift -0.0095 0.004 -2.117 0.034 -0.018 -0.001
ar.L1 0.2339 0.231 1.013 0.311 -0.219 0.686
ar.S.L12 -0.4719 0.299 -1.577 0.115 -1.059 0.115
sigma2 0.0922 0.025 3.742 0.000 0.044 0.140
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 3.29
Prob(Q): 0.95 Prob(JB): 0.19
Heteroskedasticity (H): 2.07 Skew: 0.72
Prob(H) (two-sided): 0.27 Kurtosis: 3.73


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

SARIMA forecast¶

SARIMA six months forecast¶

Modeling: Auto ARIMA¶

Using the pmdarima package.

Auto ARIMA fitting¶

Performing stepwise search to minimize aic
 ARIMA(1,0,1)(0,1,1)[12] intercept   : AIC=inf, Time=0.25 sec
 ARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=22.539, Time=0.01 sec
 ARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=20.619, Time=0.10 sec
 ARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=inf, Time=0.16 sec
 ARIMA(0,0,0)(0,1,0)[12]             : AIC=35.875, Time=0.01 sec
 ARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=24.103, Time=0.02 sec
 ARIMA(1,0,0)(2,1,0)[12] intercept   : AIC=19.572, Time=0.28 sec
 ARIMA(1,0,0)(2,1,1)[12] intercept   : AIC=21.572, Time=0.29 sec
 ARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=inf, Time=0.27 sec
 ARIMA(0,0,0)(2,1,0)[12] intercept   : AIC=17.646, Time=0.20 sec
 ARIMA(0,0,0)(1,1,0)[12] intercept   : AIC=18.920, Time=0.05 sec
 ARIMA(0,0,0)(2,1,1)[12] intercept   : AIC=19.646, Time=0.22 sec
 ARIMA(0,0,0)(1,1,1)[12] intercept   : AIC=inf, Time=0.31 sec
 ARIMA(0,0,1)(2,1,0)[12] intercept   : AIC=19.452, Time=0.27 sec
 ARIMA(1,0,1)(2,1,0)[12] intercept   : AIC=inf, Time=0.57 sec
 ARIMA(0,0,0)(2,1,0)[12]             : AIC=39.479, Time=0.10 sec

Best model:  ARIMA(0,0,0)(2,1,0)[12] intercept
Total fit time: 3.109 seconds
SARIMAX Results
Dep. Variable: y No. Observations: 42
Model: SARIMAX(2, 1, 0, 12) Log Likelihood -4.823
Date: Sat, 22 Jan 2022 AIC 17.646
Time: 12:04:04 BIC 23.251
Sample: 0 HQIC 19.439
- 42
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -0.6674 0.133 -4.999 0.000 -0.929 -0.406
ar.S.L12 -0.7443 0.335 -2.219 0.026 -1.402 -0.087
ar.S.L24 -0.6821 0.512 -1.332 0.183 -1.686 0.322
sigma2 0.0449 0.047 0.962 0.336 -0.047 0.136
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 2.12
Prob(Q): 0.97 Prob(JB): 0.35
Heteroskedasticity (H): 1.99 Skew: 0.62
Prob(H) (two-sided): 0.29 Kurtosis: 3.41


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
auto-fit order: (0, 0, 0)
auto-fit seasonal_order: (2, 1, 0, 12)
Mean squared error: 0.08415, Mean abs percentage error: 0.06593

Auto ARIMA forecast¶

ARIMA(order=(0, 0, 0), scoring_args={}, seasonal_order=(2, 1, 0, 12),
      suppress_warnings=True)

Auto ARIMA six months forecast¶

Cross-validation split and model performance evaluation¶

To compare models, it is best to use a cross-validation split, rather than just the performance on a test set. In time-series, cross-validation works differently as we need to take into account the sequence of events. I will use the TimeSeriesSplit function of Sci-kit Learn. I will compare the performance on 3 splits with a test size of 3, as not too many data points are available.

Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38]
Test: [39 40 41]
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41]
Test: [42 43 44]
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44]
Test: [45 46 47]

The three models (Triple Exponential Smoothing, SARIMA 100x110-12, ARIMA 000x210-12) offer similar performance of the cross-validated test set. The ARIMA model has the lowest performance (maybe due to the automatic fit with not many data points available), followed by the exponential smoothing, and finally the SARIMA model. It is remarkable how a relatively simple Triple Exponential Smoothing model is able to compete with the SARIMA model under this case study.

date real sarima auto_arima triple_smoothing
0 2010-03-31 4.51406 4.389967 4.494366 4.405448
1 2010-04-30 4.01282 4.116398 4.208333 4.192382
2 2010-05-31 4.11602 4.042717 4.075042 4.139283
3 2010-06-30 4.03282 3.743101 3.732752 3.830997
4 2010-07-31 3.41000 3.623548 3.828459 3.770710
5 2010-08-31 3.97004 3.870919 3.934475 3.949263
6 2010-09-30 3.87928 3.885768 4.104954 4.060887
7 2010-10-31 4.70402 4.342627 4.483800 4.492324
8 2010-11-30 4.43200 4.713179 4.727936 4.816216
MAPE:
Triple Smoothing: 0.046
Auto ARIMA: 0.049
SARIMA: 0.041

MSE:
Triple Smoothing: 0.049
Auto ARIMA: 0.055
SARIMA: 0.042

Six-months future forecast from three models¶

The three models give similar prediction, with the same trend of decreasing demand moving towards summer. The SARIMAX model is the least conservative, predicting lower amount of power needed with respect to the other models.

ARIMA(order=(0, 0, 0), seasonal_order=(2, 1, 0, 12))

Findings (analysis)¶

  • The 99th percentile of the monthly power consumption (99MPC) was analyzed to size a hypothetical solar panel system able to satisfy the power needs of the household in most situations, minimizing grid reliance
  • The 99MPC time series shows a decreasing trend. It is unclear if this is due to change of habits or replacement of some power-hungry appliances. The analysis of the three sub-meterings (not provided here) could offer more insights into the reasons
  • As the 99MPC decreased in time, sizing the system on past power demand peaks could satisfy the future needs. For example, a 5.5kW would suffice
  • The sizing does not take into account the addition of big appliances or other power hungry devices, such as kitchen electric plates or an EV charging station
  • It is unlikely that the general trend will keep decreasing in the future with the same appliances. Rather, the power consumption trend would probably dampen and stabilize to a constant value baseline
  • The future prediction for the next six months does not provide significant insights for the solar panels sizing

Findings and suggestions (models)¶

  • The 99th percentile of the monthly power consumption was analyzed as the metric, as it is both a well-behaved metric, and it captures the functional need of the analysis (the sizing of a solar panels system should be based on covering up to a fraction of the household power needs, in this case roughly 99%)
  • The 99MPC time series shows trend, seasonality, and autocorrelation
  • The decomposition of trend, seasonal, and residual components of the time series using an additive model is very successful in capturing the dynamic of the 99MPC in time
  • As the time series presents both seasonal and trend component, single or double exponential smoothing are not the most suitable technique to model the series. A triple exponential smoothing model is selected as the first tested model
  • The triple exponential smoothing model, trained on the full time series excluding the last six data points (test set), is validated against the test set. The performance of this rather simple model is remarkable. It is likely due to the well-behaved 99MPC metric, that does not present significant noise as it is averaged on many data points (a month worth) and it considers a stable statistics (99th percentile) rather than the mean value
  • From the analysis of the autocorrelation function plot and the partial autocorrelation function plot, an AR(1) model is selected for the second model (ARMA-based)
  • The prediction of the AR(1) model is not satisfactory, as there is an unaccounted for seasonality term in the residuals
  • A SARIMA(1,0,0)x(0,1,1,12) is tested next. The parameters for the seasonal component are derived from the analysis of the seasonal component only from the seasonal decomposion of the time series
  • The SARIMA model performs better on this test set, as it can model the seasonality of the series
  • Another seasonal ARIMA model is fit next. This time, the coefficients are selected automatically (through minimization of AIC). The result is a (0,0,0)x(2,1,0,12) model. It is effectively a SARIMA model, but it will be referred to as ARIMA to avoid confusion with the previous SARIMA model
  • The performance of this ARIMA model is very similar to the previous SARIMA model
  • Cross-validation is used to rank the three promising models (triple exponential smoothing, manual SARIMA, automatic ARIMA) on three data splits with test sample size of three data points. The number of splits is only three because only 48 monthly data points were available in total
  • The ranking result: SARIMA, Triple Exponential Smoothing, ARIMA. The results are however very similar between the three models
  • The selected model for future analysis is the SARIMA model
  • An RNN or LSTM model could be implemented, however more data would be needed to train the models to achieve a performance similar to the SARIMA model
  • The training of the neural network models could also take advantage of the three sub-meterings time series to provide more accurate results, but it is still subject to the limitation of the low number of datapoints. Therefore, RNN and LSTM are not considered in this report

Conclusions¶

Several time series modeling techniques were chosen to model the monthly 99th percentile of Global_active_power. A SARIMA model was chosen to best approximate and predict the future behavior of the series. The SARIMA model captured both the long term trend, and the seasonality. The 99MPC is decreasing in time, and 5.5kW is the selected solar panels system size (considering the notable assumption of sun always shining and all solar panels power being available for appliances). Given the decreasing trend of the series, a 6 months forecast is not useful in determining the size of the solar power system.